Import red and white wine data
First I want to see the distribution of the data. Not sure if there is an easier way to do this than copy/paste the same things many time. but here is what I did to plot the histogram for the variables
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
Looks like a few variable may have long tail, so i log transformed them
Residual sugar is interesting, look like it should be sweet or not sweet. Curious what redwine looks like
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Look like only white wine should be sweet
Alcohol is also interesting, looks like there is a skew toward lower alcohol.
Look at quality distribution
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Looks like most wine in the data are between 5, 6 and 7. Table confirmed this.
The paper in the footnote suggested alchol and sulphate are very important for white wine. Look at their distribution
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
##Observation ###Data Structure White wine data set has 4898 observation with 12 variables. The main feature is quality. All the variables except for quality is continuous variables.
Sugar level has interesting bimodal pattern. I was not able to find the similar pattern in the red wine dataset.
Alcohol level seems to skew toward lower alcohol, suggest that’s more common
The main feature I want to understand is quality and how other variables affect quality. First I do a ggpairs to get an idea on if there is any obvious trend
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Doesn’t look like there is a trend. I look at the bottom row, quality vs other variables. I did not see any clear pattern, looks like nothing correlate. Then I look at the last column. I think this is the correlation between quality and other variables, and from the number it look like alcohol content has the strongest correlation
plot the top four varibles for each quality
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
Looks like alcohol goes up with quality, residual and citric acid variance gets smaller with quality. However, I’m surprise that there is no clear pattern for sulphate, which was the dominant variable according to the paper.
I feel like quality should be a combination of multiple factor, for example, high sugar need to have high citric acid etc. So i change the plot to color by quality Looks like we see an increase in alocohol and a lower of sulphates as quality increase form 5 to 7. As quality increase, the citrici acid level decrease.
Is there a relationship between these variable? I will look at just one quality and see
There is some obvious pattern, free sulphates vs total sulphate and some not so obvious pattern such as density vs alcohol
Create a factor for the sweetness to separate the two type of wine for more detail analysis There doesn’t seem to be very obvious pattern. Some conjecture * Sweet wine tend to be less alcoholic. * Sweet wine tend to have a tighter sulphate range * Sweet wine tend to be have more citric acid
Also want to see how sweet factor influence quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5108 0.5306 1.6490 1.4810 2.2930 4.1870
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## sweet m
## 1 not sweet 5.972541
## 2 sweet 5.783971
The pattern is similar but seems like sweet wine may have just bit lower rating on average.
How about citric acid, this was mention in the paper.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## sweet citric sulphates alcohol quality
## 1 not sweet 0.3256230 0.4965902 11.01900 5.972541
## 2 sweet 0.3426973 0.4831530 10.01323 5.783971
I didn’t see a big difference in pattern in citric acid either. But running summary, looks like there may be slightly higher citric acid and lower alcohol when wine is sweet. whiteWineDf.subset
T test to see if difference is significant
t.test(whiteWineDf$citric.acid~whiteWineDf$sweet)
##
## Welch Two Sample t-test
##
## data: whiteWineDf$citric.acid by whiteWineDf$sweet
## t = -4.95, df = 4880.389, p-value = 7.671e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02383667 -0.01031206
## sample estimates:
## mean in group not sweet mean in group sweet
## 0.3256230 0.3426973
t.test(whiteWineDf$alcohol~whiteWineDf$sweet)
##
## Welch Two Sample t-test
##
## data: whiteWineDf$alcohol by whiteWineDf$sweet
## t = 31.3237, df = 4863.078, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9428268 1.0687235
## sample estimates:
## mean in group not sweet mean in group sweet
## 11.01900 10.01323
t.test(whiteWineDf$sulphates~whiteWineDf$sweet)
##
## Welch Two Sample t-test
##
## data: whiteWineDf$sulphates by whiteWineDf$sweet
## t = 4.1251, df = 4825.791, p-value = 3.769e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.007051093 0.019823295
## sample estimates:
## mean in group not sweet mean in group sweet
## 0.4965902 0.4831530
t.test(whiteWineDf$residual.sugar~whiteWineDf$sweet)
##
## Welch Two Sample t-test
##
## data: whiteWineDf$residual.sugar by whiteWineDf$sweet
## t = -100.2472, df = 2930.601, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.471701 -8.146657
## sample estimates:
## mean in group not sweet mean in group sweet
## 2.221557 10.530736
Looks like the difference between sweet and not sweet wine are quite large.
According to the paper, sulphate is the dominant factor in determine the quality of the wine. However, in my exploration, I found alcohol level, residual sugar variance and citric variance have more correlation with quality. I think this maybe because I’m look at single variable and didn’t use more sophisticate technique like in the paper.
I was able to plot the top 3 variables by setting two variables as x and y and one variables as color. This was then facet wrap with quality to identify pattern. Looks like we see an increase in alocohol and a lower of sulphates as quality increase form 5 to 7 As quality increase, the citrici acid level decrease.
Also interesting is the difference in alcohol, citric acid, sulphate and residual sugar between sweet and non sweet wine.
## [1] 0.9258881
## [1] 0.4487546
## [1] 6 5 7 8 4 3 9
92% of the white wine in the sample are between 5, 6 and 7. 45% of the wine are at quality 6. We only observe wine between 3-9. The distribution is similar for sweet wine and not sweet wine
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The sweetness of the white wine are bimodal. There is a difference in distribution in citric acid, residual sugar and alcohol between sweet wine and not sweet wine. T test show that the difference is significant.
## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Warning: Removed 68 rows containing non-finite values (stat_boxplot).
Finally, there seems to be a clear correlation between alcohol, residual sugar and quality. There should be a correlation between sulphates and citric acid to quality but its not very clear in the graph. Citric acid variance tend to get smaller as quality increase. However, its hard to tell. Same for sulphates.
The white wine data set contain 4898 samples. However most of the wine are in the 5, 6 and 7 quality range. So I decided it was better to limit the analysis to that set because there is may not be enough sample for the other quality. Not sure if it was reasonable to drop part of the data, this may have cause me to miss important rend.
I initially just jumped in and tried to determine the regression model. It was very difficult, so I went back and follow the process of single variable, then multivariable. Even then, it was still very difficult to determine what factor influence the quality of the wine. I also struggled with figuring out which variables should be log transformed.
There is a lot of variables that interact us each other to determine the quality of the wine. This make it difficult to determine the quality of the wine with simple linear regression. I was also suprised to not see any obvious relationship between quality and sulphates, which was listed as the most important factor in the paper
I think breaking down white wine by two sugar category was a good move. It help remove some of the noise. The variables for quality white wine can be very different depend on the sugar level. T test also confirmed this.
If I had more time, I think I would like to apply some of the technique the paper discuss to try to understand quality model better.